About the project

Hello data science world! You will find my repository here. I am new to all that, but I have been learning fast, so let’s see how far I can manage with the many bugs here.


1st Week

Tools and methods for open and reproducible research

During the first week we went through the general instructions of the course, deadlines, tools to learn etc.- and start working! We get familiar with R, RStudio and GitHub. We create our own GitHub repository: a place on the web to store your course codes and analyses.

2nd Week

Regression and model validation

This week we are learning how to manage with large datatables and create working size tables; a process called data wrangling. We then focus on data analysis, single and multiple regression analysis and model validation.

Learning dataset

I will be mostly working with data produced by Kimmo Vehkalahti as part of his work for the study of:

“The relationship between learning approaches and students’ achievements in an introductory statistics course in Finland”

  • Preliminary data were presented at a conference in Rio de Janeiro for the 60th World Statistics Congress - ISI2015. Slides are available here.

I am using a shorter version of the dataset after shortening the table to a more manageable size. Here is the structure of the table:

## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

Here, you can see the same data in a table format for your interest.

Here is a summary table of the above data:

##  gender       Age           Attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

But for better visualisation and understanding of the relationship from the available variables, here is a paired plot:

Description:

  1. There is almost twice as many female students as there are male students participating in this study
  2. Male students are more likely to show high attitude towards statistics
  3. A negative correlation is observed in male participants points scored and their response to deep and surface question. Female student responses show only little negative correlation between points scored and surface questions.
  4. Based on the summary analysis deep, surface, strategic questions and points are normally distributed as their mean and median values are very close. While age distribution seems to be scewed towards younger participants (SD: 21 - 27 years old), with a few older students (maximum 55 years old). This is expected given that the participants are students.

Creating a regression model

Let’s see what is the most significant factor affecting the final exam points of a student.

First, how is the exam points scored affected by the student’s attitude, age and strategic skills (based on their average response to strategic questions).

## 
## Call:
## lm(formula = Points ~ Attitude + Age + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## Attitude     0.34808    0.05622   6.191 4.72e-09 ***
## Age         -0.08822    0.05302  -1.664   0.0981 .  
## stra         1.00371    0.53434   1.878   0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

It looks like the most important factor affecting the student’s final exam result is purely their attitude (p-value > 0.001)! This is a multivariate regression model. Score on strategic and surface questions show no significant relationship with exam points scorred. The R-squared value of 0.22 implies that the model can explain 22 % or about one-fifth of the variation in the outcome.

Here, is the summary of the linear model with the two parameters alone:

## 
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

This univariate model explains that the variable of exam points scored is significantly related to the students attitude (p-value > 0). At a theoretical attitude value of zero the points scored would be 11.64 with a standard error 1.83. For every point of attitude increased, there is 0.35 more exam points scored with a standard error of 0.06.

R-squared is always between 0 and 1 (and as percentage when multiplied by 100): 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. Multiple R-squared is used for evaluating how well the model fits the data. In this case, Multiple R-squared value of 0.19 implies that the model can explain only 19% of the variation in the outcome.

So, this is how the model actually looks like:

Graphical model validation

R makes it easy to graphically explore the validity of your model assumptions. If you give a linear model object as the first argument to the plot() function, the function automatically assumes you want diagnostic plots and will produce them. You can check the help page of plotting an lm object by typing ?plot.lm or help(plot.lm) to the R console.

Residuals vs Fitted plot

The Residuals vs Fitted plot studies the variance of the residuals. This plot represents the deviation of the observed values of an element of a statistical sample from its “theoretical value”. The residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest. Here, the linearity of the regression model is supported by a homogeneity of variance (or homoscedasticity) with residuals that appear close to 10 points difference from the fitted values. This is off course with the exception of the observations 145, 56 and 35. * In other words: + We are looking at how well the model describes the variables we are interested in + The target variable is modelled as a linear combination of the model parameters + Errors are normally distributed and have constant variance

Normal Q-Q plot

A Q-Q (quantile-quantile) plot is a probability plot for comparing two probability distributions by plotting their quantiles against each other. Here, the two distributions being compared are the fitted model vs the observed values. Typically the standard deviation of residuals in a sample vary greatly from one datapoint to another even when the errors all have the same standard deviation, thus it does not make sense to compare residuals at different data points without first standartizing. Here, we see that the observations 145, 56 and 35 are again not good representations of the fitted model. Our model here is normally distrubuted and the non fitting observations appear at the extremes of the distribution and are expected to have very little impact on the model (this is studied as part of the next graph).

Residuals vs Leverage plot

Finally, we plotted the standardised residuals against the leverage of the fitted model. Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. Plotting standardised residuals against leverage shows the impact of every single observation on the fitted model. There are some results that have a high impact on the model which suggests that the results are driven by a few data points; that’s what this plot is intended to help us determine. Interestingly, here the most influencial points are again 56 and 35 but 145 does not have very high leverage, but now, 71, a new observation appeared.

Note: A very good explanation came from this link.


3nd Week

Logistic Regression

Student performance dataset

This week, we are working with a new dataset. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks.

high and low alcohol consumption should be mentiond…

You can find out more about this dataset and the related study here.

But here is the variable names included in the following analysis:

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

And here is how the data has been structured:

## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

Hypothesis

Using the given data, we are going to study student’s behavior with regards to alcohol consumption. According to stereotypes, students who are being raised by one of the two parents alone (Pstatus = A, “apart”) are more likely to seek alcohol to soothen emotional distress. When those students are members of big families (famsize = GT3, “greater than 3”) the situation could lead to higher alcohol consumption. These students are also more likely to miss class (absences are numeric from 0 to 93). While those students who are busy with extracurricular activities (activities = yes) could be distracted, get more supervision and support which leads to less alcohol consumption and less abcences recorded from class. Let’s study now if that is true based on the given data sample.

famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

activities - extra-curricular activities (binary: yes or no)

absences - number of school absences (numeric: from 0 to 93)

4th Week

Clustering and classification

The Boston Dataset

This week we are using the Boston dataset as provided by the package MASS. Harrison and Rubinfeld 1978 published this data for the first time but you can find more information, on a nice layout and simplyfied form, here.

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506  14
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

From the data structure we can already see that most variables are numeric, the only variable that is integral is called chas and is a Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). The following plots aim to help with the investigation of the data relationships.

## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded

So, let’s standardise the dataset and see how it looks. Data have been standardised with this formula: \[scaled(x)= \frac{x−mean(x)}{sd(x)} \]

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

To make it easier to work with, I am going to transform the crime variable into quantile bins and basically make it a categorical variable.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
## crime
##      low  med_low med_high     high 
##      127      126      126      127
## 'data.frame':    506 obs. of  14 variables:
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...

Then, I am going to divide the dataset into two random groups, the train set and the test set. The train set consists of 80 % of the observations.

Linear Discriminant analysis

Linear Discriminant analysis is a classification (and dimension reduction) method. It finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable.

Linear discriminant analysis is closely related to many other methods, such as principal component analysis (we will look into that next week) and the already familiar logistic regression.

LDA can be visualized with a biplot. The LDA biplot arrow function used in the exercise is (with slight changes) taken from this Stack Overflow message thread.

## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2599010 0.2475248 0.2450495 
## 
## Group means:
##                  zn      indus       chas        nox           rm
## low       1.0035978 -0.9680958 -0.1148451 -0.9042817  0.451135609
## med_low  -0.1111464 -0.2393904 -0.1223443 -0.5667784 -0.146847035
## med_high -0.4092038  0.2401365  0.2001230  0.3974445  0.005615594
## high     -0.4872402  1.0171737 -0.1132543  1.0911183 -0.493433724
##                 age        dis        rad        tax     ptratio
## low      -0.8984698  0.9302293 -0.6775274 -0.7172732 -0.47831855
## med_low  -0.3436034  0.3367864 -0.5476413 -0.4490689 -0.03224992
## med_high  0.3958987 -0.3779279 -0.4110831 -0.2752934 -0.23027535
## high      0.8630468 -0.8748867  1.6375616  1.5136504  0.78011702
##                black       lstat        medv
## low       0.37708439 -0.76173034  0.53399349
## med_low   0.34859173 -0.10133433 -0.02086026
## med_high  0.07875764  0.03049081  0.11549219
## high     -0.69871933  0.99044033 -0.73634050
## 
## Coefficients of linear discriminants:
##                  LD1          LD2          LD3
## zn       0.118984613  0.721320868 -0.805895774
## indus   -0.024268549 -0.332094543  0.476827253
## chas    -0.007723292 -0.041780524 -0.059544977
## nox      0.403966820 -0.612733653 -1.485626345
## rm       0.006985266  0.009994678 -0.094195185
## age      0.276535719 -0.243972787 -0.259278622
## dis     -0.126937979 -0.144040056 -0.007749645
## rad      3.125463265  0.933367654  0.067733879
## tax      0.021995574  0.033025691  0.401302437
## ptratio  0.112030665 -0.016774734 -0.263919571
## black   -0.123930425  0.017650941  0.162839109
## lstat    0.213701427 -0.152082442  0.391487987
## medv     0.095947383 -0.373139891 -0.251273242
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9456 0.0402 0.0142

Expectedly the prior probability of groups results to about 1 fourth in each case, since the classes created were based the data divided in quantiles. It seems that LD1 explains 94.5 % of the group variance, I will therefore use `dimen = 2´.

Now, I can better understand the coefficients of linear discriminants based on the arrows on vusualised on the plot. It seems that the variable rad (index of accessibility to radial highways) has the biggest impact on crime (Coefficient of linear discriminants = 3.15 for LD1). Next, comes the variable zn (proportion of residential land zoned for lots over 25,000 sq.ft.) and nox (nitrogen oxides concentration (parts per 10 million)) that separate low from medium crime on a gradient.

Let’s see how is the model performing now. I am going to use the test dataset, the 20 % of the observations that were separated earlier from the main table. Is my model going to predict everything correctly?

##           predicted
## correct    low med_low med_high high
##   low       13      12        2    0
##   med_low    4      10        7    0
##   med_high   1       6       18    1
##   high       0       0        0   28

K-means 10 clusters works best here.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Bonus

Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# k-means clustering
km <-kmeans(boston_scaled2, centers = 2)


lda.fit2 <- lda(crim ~., data = as.data.frame(boston_scaled2) )
## Warning in lda.default(x, grouping, ...): variables are collinear
# lda.fit2

# visualize the results
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit2, myscale = 2)

Super-Bonus

Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

Next, install and access the plotly package. Create a 3D plot (Cool!) of the columns of the matrix product by typing the code below.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities? (0-3 points to compensate any loss of points from the above exercises)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime )
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color =  classes
            )

bottom of the page